NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Atos: A Task-Parallel GPU Scheduler for Graph Analytics

Chen, Yuxin; Brock, Benjamin; Porumbescu, Serban; Buluç, Aydın; Yelick, Katherine; Owens, John D. (August 2022, Proceedings of the International Conference on Parallel Processing)

Full Text Available
Asynchrony versus bulk-synchrony for a generalized N-body problem from genomics

https://doi.org/10.1145/3437801.3441580

Ellis, Marquita; Buluç, Aydın; Yelick, Katherine (February 2021, PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
null (Ed.)
Full Text Available
Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication

https://doi.org/10.1145/3447818.3461472

Selvitopi, Oguz; Brock, Benjamin; Nisa, Israt; Tripathy, Alok; Yelick, Katherine; Buluç, Aydın (June 2021, ICS '21: Proceedings of the ACM International Conference on Supercomputing)
null (Ed.)
Full Text Available
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

https://doi.org/10.1145/3466795

Yang, Carl; Buluç, Aydın; Owens, John D. (January 2021, ACM transactions on mathematical software)
null (Ed.)
High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction.Exploiting output sparsityallows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in “GraphBLAST”, the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse andGBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework ,while offering a simpler and more concise programming model.
more » « less
Full Text Available
BCL: A Cross-Platform Distributed Data Structures Library

https://doi.org/10.1145/3337821.3337912

Brock, Benjamin; Buluç, Aydın; Yelick, Katherine (August 2019, Proceedings of the 48th International Conference on Parallel Processing)

One-sided communication is a useful paradigm for irregular paral- lel applications, but most one-sided programming environments, including MPI’s one-sided interface and PGAS programming lan- guages, lack application-level libraries to support these applica- tions. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular ap- plications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.
more » « less
Full Text Available
Graph Coloring on the GPU

Osama, Muhammad; Truong, Minh; Yang, Carl; Buluç, Aydın; Owens, John D. (May 2019, Proceedings of the Workshop on Graphs, Architectures, Programming, and Learning)

We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one data-centric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-the-art implementations on real-world datasets. Our GraphBLAS implementation of Luby's algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x.
more » « less
Full Text Available
The parallelism motifs of genomic data analysis

https://doi.org/10.1098/rsta.2019.0394

Yelick, Katherine; Buluç, Aydın; Awan, Muaaz; Azad, Ariful; Brock, Benjamin; Egan, Rob; Ekanayake, Saliya; Ellis, Marquita; Georganas, Evangelos; Guidi, Giulia; et al (March 2020, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences)

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.
more » « less
Full Text Available

Search for: All records